Getting Started with gensim

This section introduces the basic concepts and terms needed to understand and use gensim and provides an example of a simple model.

Core Concepts and Simple Example

At a very high-level, gensim is a tool for discovering the semantic structure of documents by examining the patterns of words (or higher-level structures such as entire sentences or documents). gensim accomplishes this by taking a corpus, producing a vector representation of the text in the corpus and using the vector representation to train a model. These three concepts are key to understanding how gensim works so let's take a moment to explain what each of them means. At the same time, we'll work through a simple example that illustrates each of them.

Corpus

A corpus is a collection of digital documents. This collection is the input to gensim from which it will infer the structure of the documents, their topics, etc. The latent structure inferred from the corpus can later be used to assign topics to new documents which were not present in the training corpus. For this reason, we also refer to this collection as the training corpus. No human intervention (such as tagging the documents by hand) is required - the topic classification is unsupervised.

For our corpus, we'll use a list of 9 strings, each consisting of only a single sentence.


In [8]:
raw_corpus = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

After collecting our corpus, there are typically a number of preprocessing steps we want to undertake. We'll keep it simple and just remove some commonly used English words (such as 'the') and words that occur only once in the corpus. In the process of doing so, we'll 'tokenize' our data. Tokenization breaks up the documents into words (in this case using space as a delimiter).


In [9]:
# Remove common words
stoplist = set('for a of the and to in'.split(' '))
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in raw_corpus]

# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
processed_corpus


Out[9]:
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Before proceeding, we want to associate each word in the corpus with a unique integer ID. We can do this using the gensim.corpora.Dictionary class.


In [10]:
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)


Dictionary(12 unique tokens: ['human', 'interface', 'response', 'computer', 'trees']...)

Vector

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector. There are various approachs for creating a vector representation of a document but a simple example is the bag-of-words model. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word. For example, a document consisting of the string "coffee coffee milk sugar" could be represented by the vector [2, 1, 1] where the entries of the vector are (in order) the occurrences of "coffee", "milk" and "sugar" in the document.

Our processed corpus has 12 unique words in it, which means that each document will be represented by a 12-dimensional vector under the bag-of-words model. We can use the dictionary to turn tokenized documents into these 12-dimensional vectors. For example, suppose we wanted to vectorize the phrase "Human computer interaction" (note that this phrase was not in our original corpus):


In [11]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
new_vec


Out[11]:
[(0, 1), (2, 1)]

The first entry in each tuple corresponds to the IDs in the dictionary. We can see what these IDs correspond to:


In [12]:
print(dictionary.token2id)


{'human': 0, 'interface': 1, 'response': 5, 'computer': 2, 'trees': 9, 'graph': 10, 'system': 6, 'user': 4, 'survey': 7, 'time': 3, 'minors': 11, 'eps': 8}

Note that "interaction" did not occur in the original corpus and so it was included in the vectorization. Also note that this vector only contains entries for words that actually appeared in the document. Because any given document will likely not contain a particular word, words that do not appear in the vectorization are implicitly zero as a space saving measure.

We can convert our entire original corpus to a list of vectors:


In [13]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
bow_corpus


Out[13]:
[[(0, 1), (1, 1), (2, 1)],
 [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(1, 1), (4, 1), (6, 1), (8, 1)],
 [(0, 1), (6, 2), (8, 1)],
 [(3, 1), (4, 1), (5, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(7, 1), (10, 1), (11, 1)]]

Note that while this list lives entirely in memory, in most applications you will want a more scalable solution. Luckily, gensim allows you to use any iterator that returns a single document vector at a time. See the documentation for more details.

Model

Now that we have our vectorized corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.

One simple example of a model is tf-idf. tf-idf transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

Here's a simple example. Let's initialize the tf-idf model, training it on our corpus and transforming the string "system minors":


In [14]:
from gensim import models
tfidf = models.TfidfModel(bow_corpus)
tfidf[dictionary.doc2bow("system minors".lower().split())]


Out[14]:
[(6, 0.5898341626740045), (11, 0.8075244024440723)]

Note that the ID corresponding to "system" (which occurred 4 times in the original corpus) has been weighted lower than the ID corresponding to "minors" (which only occurred twice).

gensim offers a number of different models/transformations. See Transformations and Topics for details.